{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# COMPSCI 389: Introduction to Machine Learning\n",
    "# Topic 10.3 Automatic Differentiation for ML\n",
    "\n",
    "In this notebook we show how automatic differentiation can be used for ML by running gradient descent on the sample MSE for a linear parametric model fit to the GPA data set.\n",
    "\n",
    "First, here are the import statements we will use. We will use a train-test split and standardization. We will also use `shuffle` from `sklearn.utils` to shuffle data points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import autograd.numpy as np\n",
    "from autograd import grad\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split \n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.utils import shuffle"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's load the GPA data set, split it into inputs `X` and labels `y`. Unlike before, we will make these `ndarray` objects from `autograd.numpy` by calling `.values` after `.iloc[...]`. Remember that you can load the GPA data set directly from online (the upper line), or from a local download (the commented out lower line)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv\", delimiter=',') # Read GPA.csv, assuming numbers are separated by commas\n",
    "# df = pd.read_csv(\"data/GPA.csv\", delimiter=',')\n",
    "\n",
    "# Split into features and labels\n",
    "X = df.iloc[:, :-1].values\n",
    "y = df.iloc[:, -1].values.reshape(-1, 1)  # Reshape for vectorized operations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's split into training and testing sets, using $80\\%$ of the data for training and $20\\%$ for testing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's use `StandardScaler` to pre-process the inputs. Notice that we use `fit_transform` on the training data to find the necessary values (in this case the mean and standard deviation). At test time, we then want to use these same values, since our model was trained under the assumption that the data would be pre-processed using these values. Hence, we use `transform` rather than `fit_transform` when pre-processing the testing data.\n",
    "\n",
    "While it can be reasonable to run `fit_transform` once on all of the data, this would result in the model being (partially) computed from the testing data. The approach used below of calling `fit_transform` on just the training data ensures that the testing data does not influence the learned model in any way."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Standardize the features\n",
    "scaler = StandardScaler()\n",
    "X_train = scaler.fit_transform(X_train) # This sets the min/max values from the training data (without looking at the testing)\n",
    "X_test = scaler.transform(X_test)       # This uses the min/max scaling values chosen during training! (transform, not fit_transform)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, since we won't be using a basis, let's append a column of ones to the `X_train` and `X_test` numpy arrays:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = np.c_[np.ones(X_train.shape[0]), X_train]\n",
    "X_test = np.c_[np.ones(X_test.shape[0]), X_test]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's define our linear parametric model. This implements:\n",
    "\n",
    "$$\n",
    "f_w(x_i) = \\sum_{j=1}^d w_j x_{i,j}.\n",
    "$$\n",
    "\n",
    "Or, using dot-product notation,\n",
    "\n",
    "$$\n",
    "f_w(x_i) = w \\cdot x_i.\n",
    "$$\n",
    "\n",
    "For efficiency, our code will use this dot-product approach. Inside of numpy, this dot product will be computed using a loop over weights, but this loop will be implemented in a more efficient language (likely C or C++) rather than Python."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def linear_model(X, weights):\n",
    "    return np.dot(X, weights)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's implement our loss function, which computes teh sample MSE."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def loss_function(weights, X, y):\n",
    "    predictions = linear_model(X, weights)\n",
    "    return np.mean((predictions - y)**2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's use autograd to get the gradient of the loss function. We want the gradient with respect to `weights` not `X` or `y`. Remember that the `grad` function defaults to providing the derivative with respect to the first input, which is the desired behavior here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "grad_loss = grad(loss_function) # Defaults to grad(loss_function, 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, lets select the initial weight vector. Common strategies are to use all-zeros or to use random values (from some distribution, e.g., a normal distribution)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_weights = X_train.shape[1]\n",
    "#weights = np.zeros((num_weights, 1))        # Start with all weights being zero\n",
    "weights = np.random.randn(num_weights, 1)   # Sample (numWeights x 1) values from a standard normal distribution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's run 50 iterations of gradient descent with a step size (learning rate) of $0.05$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Iteration 0, Loss: 5.260402217978024\n",
      "Iteration 10, Loss: 1.2519262164232552\n",
      "Iteration 20, Loss: 0.7295958047115758\n",
      "Iteration 30, Loss: 0.627606681129744\n",
      "Iteration 40, Loss: 0.5992668395272726\n",
      "Test MSE: 0.5932749622703825\n"
     ]
    }
   ],
   "source": [
    "num_iterations = 50\n",
    "learning_rate = 0.05\n",
    "\n",
    "# Training loop\n",
    "for iteration in range(num_iterations):\n",
    "    weights -= learning_rate * grad_loss(weights, X_train, y_train)\n",
    "\n",
    "    # Print loss every 10 iterations\n",
    "    if iteration % 10 == 0:\n",
    "        current_loss = loss_function(weights, X_train, y_train)\n",
    "        print(f\"Iteration {iteration}, Loss: {current_loss}\")\n",
    "\n",
    "# Evaluate on test data\n",
    "test_loss = loss_function(weights, X_test, y_test)\n",
    "print(f\"Test MSE: {test_loss}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember when we worked out the derivatives necessary to do exactly this? That process was slow and error-prone. Using automatic differentiation techniques made this much easier. All you have to do is define your loss function and model, and autograd takes care of all of the derivatives for you!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Epochs and Mini-Batches\n",
    "\n",
    "When the data set is large, computing the sample MSE (or gradient of the sample MSE) for the entire training set can take a very long time.\n",
    "\n",
    "**Idea**: Split the training data into **mini-batches**.\n",
    "- Each mini-batch is a collection of several rows (training points).\n",
    "- Each iteration of gradient descent can use a different mini-batch.\n",
    "- The process of running gradient descent on all mini-batches one time is called an **epoch**.\n",
    "    - Hence, each epoch corresponds to one pass over the entire data set, performing one gradient update for each mini-batch.\n",
    "- Training typically involves running several epochs.\n",
    "- Different splits of the data into mini-batches are typically used for each epoch.\n",
    "- We typically define the size of each mini-batch, not the number of mini-batches."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is our code, updated to include mini-batches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 0, Loss: 0.5832311803176145\n",
      "Epoch 10, Loss: 0.584201610532094\n",
      "Epoch 20, Loss: 0.5832333795591383\n",
      "Epoch 30, Loss: 0.5836298546523091\n",
      "Epoch 40, Loss: 0.5840496234285119\n",
      "Test MSE: 0.5892578927313206\n"
     ]
    }
   ],
   "source": [
    "num_epochs = 50\n",
    "learning_rate = 0.05\n",
    "minibatch_size = 100\n",
    "\n",
    "for epoch in range(num_epochs):\n",
    "    # Shuffle the training data\n",
    "    X_train_shuffled, y_train_shuffled = shuffle(X_train, y_train)\n",
    "\n",
    "    # Loop over mini-batches\n",
    "    for i in range(0, X_train.shape[0], minibatch_size):\n",
    "        end = min(i + minibatch_size, X_train_shuffled.shape[0]) # The last mini-batch may be smaller than the others\n",
    "        X_batch = X_train_shuffled[i:end]\n",
    "        y_batch = y_train_shuffled[i:end]\n",
    "        \n",
    "        gradients = grad_loss(weights, X_batch, y_batch)\n",
    "        weights -= learning_rate * gradients\n",
    "\n",
    "    # Print loss every 10 epochs\n",
    "    if epoch % 10 == 0:\n",
    "        current_loss = loss_function(weights, X_train, y_train)\n",
    "        print(f\"Epoch {epoch}, Loss: {current_loss}\")\n",
    "\n",
    "# Evaluate on test data\n",
    "test_loss = loss_function(weights, X_test, y_test)\n",
    "print(f\"Test MSE: {test_loss}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the sample MSE reached lower values in fewer epochs! This is often the case - mini-batches not only make the amount of data used for each gradient computation more manageable, they often speed up the optimization process. The full reasoning for this is beyond the scope of the course, but is something we may discuss briefly in lecture."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}